Skip to content

Fix late selective_channel retry after EndRPC#3359

Open
hjwsm1989 wants to merge 1 commit into
apache:masterfrom
hjwsm1989:codex/selective-channel-late-subdone-fix
Open

Fix late selective_channel retry after EndRPC#3359
hjwsm1989 wants to merge 1 commit into
apache:masterfrom
hjwsm1989:codex/selective-channel-late-subdone-fix

Conversation

@hjwsm1989

@hjwsm1989 hjwsm1989 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: #3358

SelectiveChannel has a use-after-teardown race: a late SubDone::Run() can
re-enter the retry/backup path after the main RPC has already run EndRPC().

EndRPC() resets the controller's load balancer (_lb), but a sub-call that
finishes slightly later still flows through:

SubDone::Run
  -> Controller::OnVersionedRPCReturned()
     -> Controller::IssueRPC()
        -> schan::Sender::IssueRPC()   // reads _main_cntl->_lb, now null

so it retries on partially torn-down state — the null-balancer crash in #3358
(SharedLoadBalancer::SelectServer(this=0x0)). This is a distinct race from #3294.

Trigger: backup_request_ms > 0, max_retry > 0, and several sub-RPCs ending
almost simultaneously (timeout / connection failure) so a sub-call's callback
lands just after the main RPC's EndRPC().

What is changed and the side effects?

Changed:

  • EndRPC() sets a new FLAGS_ENDING_RPC flag at its very start.
  • OnVersionedRPCReturned() checks is_ending_rpc() and ignores late SubDone
    callbacks instead of re-entering retry/backup.
  • schan::Sender::IssueRPC() keeps a defensive null check on the balancer as a
    last-resort guard at the selective_channel boundary.

Test:
Added a regression test using SelectiveChannel with backup request + retry that
lets the main RPC time out first and delayed sub-calls finish afterward —
reproducing the late-callback window and asserting it no longer re-enters
retry/backup. Re-ran the related selective / backup-request tests.

Side effects:

  • Performance: none. The guard is a single flag check on the already-cold
    teardown path; the hot path is unchanged.
  • Breaking backward compatibility: none.

Cluster validation

Also validated on a 3-node, production-like cluster issuing periodic
multi-address meta RPCs over SelectiveChannel (backup_request_ms set,
max_retry=3, short timeout). With the target on a remote node, one-way egress
delay (tc netem delay 150ms) was injected while the per-RPC timeout stayed at
50ms, so the main RPC enters EndRPC() while backup/retry sub-RPCs are still in
flight and their SubDone::Run() lands afterward.

The fix is silent by design (a late SubDone just hits the is_ending_rpc()
guard and is dropped), so to confirm the window was actually exercised I
temporarily added a LOG(WARNING) at the guard. Over a 4-minute injection it
fired 2137 times with 0 crashes and 0 new coredumps, and the cluster
recovered automatically once the delay was cleared. Without this fix each of
those late SubDones would have re-entered retry/backup and dereferenced the
already-reset _lb. (The log was temporary instrumentation, not part of this PR.)

Check List:

  • Changes are compilable.
  • Added a regression test for the new behavior.
  • Follows the Contributor Covenant Code of Conduct.

Fixes: #3358

@hjwsm1989 hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from c2c7d03 to e27a7e9 Compare June 25, 2026 06:01
@hjwsm1989

Copy link
Copy Markdown
Contributor Author

@chenBright hello, please help review

_main_cntl->SetFailed(ECANCELED,
"SelectiveChannel balancer is unavailable");
return -1;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem seems unresolved. The process still crashes even if the balancer is deleted by main_cntl after the if statement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenBright thanks for review, The previous null check was not enough because it did not keep the balancer alive after _lb.get().

I updated the patch so Sender holds its own intrusive_ptr<SharedLoadBalancer> captured from SelectiveChannel at creation time, and IssueRPC() uses that retained reference. This prevents main_cntl->_lb.reset() in EndRPC() from freeing the balancer while IssueRPC() is still using it.

The FLAGS_ENDING_RPC guard is kept to prevent late SubDone from re-entering retry / backup after the main RPC starts ending.

@hjwsm1989 hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from e27a7e9 to 531f2d7 Compare June 27, 2026 09:41
@chenBright chenBright requested a review from Copilot June 27, 2026 09:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a SelectiveChannel race where late SubDone::Run() callbacks could re-enter retry/backup logic after the main RPC has entered Controller::EndRPC(), potentially operating on partially torn-down controller state (including a previously observed null balancer path).

Changes:

  • Introduces an “ending RPC” controller flag set at the beginning of Controller::EndRPC() and ignores late OnVersionedRPCReturned() callbacks once teardown has begun.
  • Makes schan::Sender hold its own reference to the load balancer and adds a defensive null check before selecting a sub-channel.
  • Adds a regression test that forces the main RPC to time out while delayed sub-calls finish later, validating late-callback behavior under retry/backup configuration.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
test/brpc_channel_unittest.cpp Adds a regression test exercising late sub-call completion after main timeout with retry + backup enabled.
src/brpc/selective_channel.cpp Passes the balancer into schan::Sender and adds a defensive null check before using it.
src/brpc/controller.h Adds FLAGS_ENDING_RPC and a helper accessor to expose “ending RPC” state.
src/brpc/controller.cpp Sets the “ending RPC” flag early in EndRPC() and short-circuits late callbacks in OnVersionedRPCReturned().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 580 to +584
if (!initialized()) {
cntl->SetFailed(EINVAL, "SelectiveChannel=%p is not initialized yet",
this);
}
schan::Sender* sndr = new schan::Sender(cntl, request, response, user_done);
schan::Sender* sndr =

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. SelectiveChannel::CallMethod() now returns immediately when the channel is not initialized, and runs done in-place if provided, matching the behavior of PartitionChannel.

This avoids allocating Sender or dispatching into the inner channel with an unavailable balancer. I also added a regression test covering both sync and async uninitialized calls.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since lb already belongs to a Sender, then in Sender::IssueRPC, _main_cntl->_lb.get() should be changed to _lb.get(), otherwise this part looks a bit confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanglimingcn This has already been addressed in the latest patch. Sender::IssueRPC() now uses the Sender-owned _lb.get() and calls SelectChannel() through the local balancer pointer, instead of reading _main_cntl->_lb.get() directly.

@hjwsm1989 hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch 2 times, most recently from 4217a8a to ee4d47a Compare June 29, 2026 05:43
@hjwsm1989

Copy link
Copy Markdown
Contributor Author

@chenBright @yanglimingcn comments updated,thanks for another review

@chenBright chenBright requested a review from Copilot June 30, 2026 02:13

This comment was marked as low quality.

Comment thread src/brpc/selective_channel.cpp Outdated
Ignore late selective_channel SubDone callbacks once the main RPC enters EndRPC, and let Sender hold the selective balancer while issuing sub-RPCs.

Return early when SelectiveChannel is called before Init, preserving the intended EINVAL result.

Add regression tests covering timeout plus delayed sub-call completion and uninitialized SelectiveChannel calls.
@hjwsm1989 hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from ee4d47a to 2587a7f Compare June 30, 2026 02:58

@chenBright chenBright left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

selective_channel: late SubDone may re-enter retry/backup after _lb.reset() and crash on null ChannelBalancer

4 participants